Skip to content

Conversation

eserilev
Copy link
Member

@eserilev eserilev commented Aug 20, 2025

Issue Addressed

#7603

Proposed Changes

Custody backfill sync service

Similar in many ways to the current backfill service. There may be ways to unify the two services. The difficulty there is that the current backfill service tightly couples blocks and their associated blobs/data columns. Any attempts to unify the two services should be left to a separate PR in my opinion.

SyncNeworkContext

SyncNetworkContext manages custody sync data columns by range requests separetly from other sync RPC requests. I think this is a nice separation considering that custody backfill is its own service.

Data column import logic

The import logic verifies KZG committments and that the data columns block root matches the block root in the nodes store before importing columns

New channel to send messages to SyncManager

Now external services can communicate with the SyncManager. In this PR this channel is used to trigger a custody sync. Alternatively we may be able to use the existing mpsc channel that the SyncNetworkContext uses to communicate with the SyncManager. I will spend some time reviewing this.

TODOs

  • Test with tracing changes from Implement tracing spans for data columm RPC requests and responses #7831
  • Make sure were verifying KZG before importing columns to the store
  • Set custody sync status to pending if cgc count changes during foward/backfill sync. Resume custody sync when its ready
  • Restart custody sync if a node shuts down during custody sync
  • Custody backfill metrics
  • Fix a bug where custody sync has issues running again after a succesful custody sync
  • Devnet-5 testing
  • Non-happy path testing
  • Update ValidatorRegistrations.epoch_validator_custody_requirements while custody syncing, or once its completed
  • Resolve TODO comments in the PR
  • Ensure custody sync can be started automatically on CGC change (currently this is disabled)
  • Review copy-pasta inline comments
  • Remove manual custody sync backfill trigerring mechanism
  • Fix notifier speedo logic

Addtional notes

This needs to be throughouly tested before being included in 8.0.0-rc.0.

@eserilev eserilev added the work-in-progress PR is a work-in-progress label Aug 20, 2025
@eserilev eserilev requested a review from jxs as a code owner August 20, 2025 05:01
@eserilev eserilev added syncing das Data Availability Sampling fulu Required for the upcoming Fulu hard fork v8.0.0-rc.0 Q3 2025 release for Fusaka on Holesky labels Aug 20, 2025
@eserilev
Copy link
Member Author

eserilev commented Aug 21, 2025

I've tested one of the non-happy paths
2025-08-21_12-21

ug 21 19:13:37.088 WARN  Some data columns are missing from the batch  epoch: Epoch(69), missing_slots: [Slot(2214), Slot(2213), Slot(2208), Slot(2211), Slot(2209), Slot(2216), Slot(2215), Slot(2212), Slot(2210), Slot(2217)] 
Aug 21 19:13:37.088 WARN  Custody backfill batch processing error       error: MissingDataColumns { missing_slots_and_data_columns: {Slot(2214): {82, 32, 61, 74, 46, 29, 124, 41, 86, 59, 88, 108, 9, 115, 11, 13, 31, 77, 89, 42, 7, 58, 76, 34, 107, 0, 113, 15, 114, 30, 17, 60, 111, 20, 109, 48, 6, 14, 62, 5, 112, 18, 93, 106, 95, 16, 37, 40, 84, 127, 126, 57, 51, 123, 70, 38, 80, 2, 96, 92, 55, 102, 23, 44, 67, 54, 68, 66, 110, 56, 81, 53, 69, 72, 100, 52, 91, 118, 122, 28, 25, 63, 39, 117, 26, 65, 87, 125, 45, 3, 97, 90, 64, 35, 104, 83, 49, 1, 98, 85, 120, 105, 75, 73, 19, 43, 24, 8, 27, 121, 21, 79, 99, 50, 47, 10, 4, 22, 119, 71, 116, 103, 101, 94, 33, 78, 36, 12}, Slot(2213): {82, 32, 61, 74, 46, 29, 124, 41, 86, 59, 88, 108, 9, 115, 11, 13, 31, 77, 89, 42, 7, 58, 76, 34, 107, 0, 113, 15, 114, 30, 17, 60, 111, 20, 109, 48, 6, 14, 62, 5, 112, 18, 93, 106, 95, 16, 37, 40, 84, 127, 126, 57, 51, 123, 70, 38, 80, 2, 96, 92, 55, 102, 23, 44, 67, 54, 68, 66, 110, 56, 81, 53, 69, 72, 100, 52, 91, 118, 122, 28, 25, 63, 39, 117, 26, 65, 87, 125, 45, 3, 97, 90, 64, 35, 104, 83, 49, 1, 98, 85, 120, 105, 75, 73, 19, 43, 24, 8, 27, 121, 21, 79, 99, 50, 47, 10, 4, 22, 119, 71, 116, 103, 101, 94, 33, 78, 36, 12}, Slot(2208): {82, 32, 61, 74, 46, 29, 124, 41, 86, 59, 88, 108, 9, 115, 11, 13, 31, 77, 89, 42, 7, 58, 76, 34, 107, 0, 113, 15, 114, 30, 17, 60, 111, 20, 109, 48, 6, 14, 62, 5, 112, 18, 93, 106, 95, 16, 37, 40, 84, 127, 126, 57, 51, 123, 70, 38, 80, 2, 96, 92, 55, 102, 23, 44, 67, 54, 68, 66, 110, 56, 81, 53, 69, 72, 100, 52, 91, 118, 122, 28, 25, 63, 39, 117, 26, 65, 87, 125, 45, 3, 97, 90, 64, 35, 104, 83, 49, 1, 98, 85, 120, 105, 75, 73, 19, 43, 24, 8, 27, 121, 21, 79, 99, 50, 47, 10, 4, 22, 119, 71, 116, 103, 101, 94, 33, 78, 36, 12}, Slot(2211): {82, 32, 61, 74, 46, 29, 124, 41, 86, 59, 88, 108, 9, 115, 11, 13, 31, 77, 89, 42, 7, 58, 76, 34, 107, 0, 113, 15, 114, 30, 17, 60, 111, 20, 109, 48, 6, 14, 62, 5, 112, 18, 93, 106, 95, 16, 37, 40, 84, 127, 126, 57, 51, 123, 70, 38, 80, 2, 96, 92, 55, 102, 23, 44, 67, 54, 68, 66, 110, 56, 81, 53, 69, 72, 100, 52, 91, 118, 122, 28, 25, 63, 39, 117, 26, 65, 87, 125, 45, 3, 97, 90, 64, 35, 104, 83, 49, 1, 98, 85, 120, 105, 75, 73, 19, 43, 24, 8, 27, 121, 21, 79, 99, 50, 47, 10, 4, 22, 119, 71, 116, 103, 101, 94, 33, 78, 36, 12}, Slot(2209): {82, 32, 61, 74, 46, 29, 124, 41, 86, 59, 88, 108, 9, 115, 11, 13, 31, 77, 89, 42, 7, 58, 76, 34, 107, 0, 113, 15, 114, 30, 17, 60, 111, 20, 109, 48, 6, 14, 62, 5, 112, 18, 93, 106, 95, 16, 37, 40, 84, 127, 126, 57, 51, 123, 70, 38, 80, 2, 96, 92, 55, 102, 23, 44, 67, 54, 68, 66, 110, 56, 81, 53, 69, 72, 100, 52, 91, 118, 122, 28, 25, 63, 39, 117, 26, 65, 87, 125, 45, 3, 97, 90, 64, 35, 104, 83, 49, 1, 98, 85, 120, 105, 75, 73, 19, 43, 24, 8, 27, 121, 21, 79, 99, 50, 47, 10, 4, 22, 119, 71, 116, 103, 101, 94, 33, 78, 36, 12}, Slot(2216): {82, 32, 61, 74, 46, 29, 124, 41, 86, 59, 88, 108, 9, 115, 11, 13, 31, 77, 89, 42, 7, 58, 76, 34, 107, 0, 113, 15, 114, 30, 17, 60, 111, 20, 109, 48, 6, 14, 62, 5, 112, 18, 93, 106, 95, 16, 37, 40, 84, 127, 126, 57, 51, 123, 70, 38, 80, 2, 96, 92, 55, 102, 23, 44, 67, 54, 68, 66, 110, 56, 81, 53, 69, 72, 100, 52, 91, 118, 122, 28, 25, 63, 39, 117, 26, 65, 87, 125, 45, 3, 97, 90, 64, 35, 104, 83, 49, 1, 98, 85, 120, 105, 75, 73, 19, 43, 24, 8, 27, 121, 21, 79, 99, 50, 47, 10, 4, 22, 119, 71, 116, 103, 101, 94, 33, 78, 36, 12}, Slot(2215): {82, 32, 61, 74, 46, 29, 124, 41, 86, 59, 88, 108, 9, 115, 11, 13, 31, 77, 89, 42, 7, 58, 76, 34, 107, 0, 113, 15, 114, 30, 17, 60, 111, 20, 109, 48, 6, 14, 62, 5, 112, 18, 93, 106, 95, 16, 37, 40, 84, 127, 126, 57, 51, 123, 70, 38, 80, 2, 96, 92, 55, 102, 23, 44, 67, 54, 68, 66, 110, 56, 81, 53, 69, 72, 100, 52, 91, 118, 122, 28, 25, 63, 39, 117, 26, 65, 87, 125, 45, 3, 97, 90, 64, 35, 104, 83, 49, 1, 98, 85, 120, 105, 75, 73, 19, 43, 24, 8, 27, 121, 21, 79, 99, 50, 47, 10, 4, 22, 119, 71, 116, 103, 101, 94, 33, 78, 36, 12}, Slot(2212): {82, 32, 61, 74, 46, 29, 124, 41, 86, 59, 88, 108, 9, 115, 11, 13, 31, 77, 89, 42, 7, 58, 76, 34, 107, 0, 113, 15, 114, 30, 17, 60, 111, 20, 109, 48, 6, 14, 62, 5, 112, 18, 93, 106, 95, 16, 37, 40, 84, 127, 126, 57, 51, 123, 70, 38, 80, 2, 96, 92, 55, 102, 23, 44, 67, 54, 68, 66, 110, 56, 81, 53, 69, 72, 100, 52, 91, 118, 122, 28, 25, 63, 39, 117, 26, 65, 87, 125, 45, 3, 97, 90, 64, 35, 104, 83, 49, 1, 98, 85, 120, 105, 75, 73, 19, 43, 24, 8, 27, 121, 21, 79, 99, 50, 47, 10, 4, 22, 119, 71, 116, 103, 101, 94, 33, 78, 36, 12}, Slot(2210): {82, 32, 61, 74, 46, 29, 124, 41, 86, 59, 88, 108, 9, 115, 11, 13, 31, 77, 89, 42, 7, 58, 76, 34, 107, 0, 113, 15, 114, 30, 17, 60, 111, 20, 109, 48, 6, 14, 62, 5, 112, 18, 93, 106, 95, 16, 37, 40, 84, 127, 126, 57, 51, 123, 70, 38, 80, 2, 96, 92, 55, 102, 23, 44, 67, 54, 68, 66, 110, 56, 81, 53, 69, 72, 100, 52, 91, 118, 122, 28, 25, 63, 39, 117, 26, 65, 87, 125, 45, 3, 97, 90, 64, 35, 104, 83, 49, 1, 98, 85, 120, 105, 75, 73, 19, 43, 24, 8, 27, 121, 21, 79, 99, 50, 47, 10, 4, 22, 119, 71, 116, 103, 101, 94, 33, 78, 36, 12}, Slot(2217): {82, 32, 61, 74, 46, 29, 124, 41, 86, 59, 88, 108, 9, 115, 11, 13, 31, 77, 89, 42, 7, 58, 76, 34, 107, 0, 113, 15, 114, 30, 17, 60, 111, 20, 109, 48, 6, 14, 62, 5, 112, 18, 93, 106, 95, 16, 37, 40, 84, 127, 126, 57, 51, 123, 70, 38, 80, 2, 96, 92, 55, 102, 23, 44, 67, 54, 68, 66, 110, 56, 81, 53, 69, 72, 100, 52, 91, 118, 122, 28, 25, 63, 39, 117, 26, 65, 87, 125, 45, 3, 97, 90, 64, 35, 104, 83, 49, 1, 98, 85, 120, 105, 75, 73, 19, 43, 24, 8, 27, 121, 21, 79, 99, 50, 47, 10, 4, 22, 119, 71, 116, 103, 101, 94, 33, 78, 36, 12}} } 

I'm guessing that since the slot is orphaned the import should actually succeed here (nothing would get imported). So i need to fix the error handling here a bit

EDIT: Ive fixed this using forward_block_iter

None
}
}
None => None, // If no DA boundary set, dont try to custody backfill
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comment

@jimmygchen jimmygchen requested a review from dapplion September 9, 2025 08:06
@jimmygchen
Copy link
Member

@dapplion would you mind reviewing this please 🙏

@dapplion
Copy link
Collaborator

While this approach is correct, we can have a simpler approach without blindly copying the machinery of the current backfill sync.

Backfill sync is conceptually simpler than forwards sync and we know exactly what data we are expecting.

The only annoyance is us requesting a range of slots that happen to have zero blocks. In that case, we need the "PendingValidation" step to handle that. However, we can make Custody backfill sync lag behind the regular backfill sync and use the roots of downloaded blocks to know exactly what data we are expecting. Then the sync process is deterministic:

  • Request columns for a range of slots. We know exactly which block roots belong to those slots. If the headers of the columns don't match, instantly penalize pthe eer and retry.
  • Then send columns to verify KZG proofs and BLS signatures, if correct, import; if incorrect, penalize peer and retry.

That's it, no need to track participating_peers or have pending validation batches.

SyncServiceMessage::CustodyCountChanged { columns } => {
// Wait for the current epoch to finalize before starting custody sync
if let Err(e) = self.custody_sync.wait_for_finalization(columns) {
tracing::warn!(error = ?e, "Failed to set custody backfill state to awaiting finalization");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tracing::warn!(error = ?e, "Failed to set custody backfill state to awaiting finalization");
warn!(error = ?e, "Failed to set custody backfill state to awaiting finalization");

@@ -293,8 +301,12 @@ impl<T: BeaconChainTypes> SyncNetworkContext<T> {
blocks_by_range_requests: ActiveRequests::new("blocks_by_range"),
blobs_by_range_requests: ActiveRequests::new("blobs_by_range"),
data_columns_by_range_requests: ActiveRequests::new("data_columns_by_range"),
custody_sync_data_columns_by_range_requests: ActiveRequests::new(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary to introduce a new set of requests, just use data_columns_by_range_requests. And you change DataColumnsByRangeRequestId::parent_request_id into an enum that can either be ComponentsByRangeRequestId or CustodySyncBatchRequestId

@@ -15,6 +15,10 @@ pub enum SyncState {
/// specified by its peers. Once completed, the node enters this sync state and attempts to
/// download all required historical blocks.
BackFillSyncing { completed: usize, remaining: usize },
/// The node is undertaking a custody backfill sync. This occurs for a node that has completed forward and
/// backfill sync and has undergone a custody count change. During custody backfill sync the node attempts
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has completed forward and backfill sync

Is it true? Or it just

The sync is waiting for the earliest_available_data_column_slots epoch to finalize before starting

The comments appear inconsistent

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they both need to be true

range sync and backfill sync must finish, and earliest_available_data_column_slots epoch must be finalized before custody backfill sync can start

/// A custody backfill sync has completed.
Completed,
/// Too many failed attempts at backfilling. Consider it failed.
Failed,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With my suggestion above, backfill sync can only fail when our current peers send us invalid data too many times. We can resume syncing once a new peer joins. So is the Failed state necessary on its own?

Failed,
/// A custody sync should is set to Pending if the node is undergoing range/backfill syncing.
/// It should resume syncing after the node is fully synced.
Pending,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can collapse AwaitingFinalization Paused Failed and Pending into a single state called Pending(reason: String) or Paused(reason: String). And then external events like new peers or finalization call the resume function which checks that we have finalized, have peers, etc and then resume.


#[derive(Debug)]
/// A segment of a chain.
pub struct CustodyBatchInfo<E: EthSpec> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can generalize range_sync::batch::BatchInfo by making it generic over the data it holds on BatchState::AwaitingProcessing and re-use for custody backfill. A batch basically does:

  • Track that a download for X is ongoing
  • Handle download retries and track failed peers
  • Track that a processing for X is ongoing
  • Handle processing retries and track failed peers

Which is the same for range sync batch and custody batch. Do you really need to track CustodyBatchInfo::columns inside of the batch? Given an epoch you should be able to compute the columns in the network_context. Also if CUSTODY_BACKFILL_EPOCHS_PER_BATCH we don't need to track start_slot and end_slot. This is legacy code from before Deneb where each batch had more than one epoch in itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
das Data Availability Sampling fulu Required for the upcoming Fulu hard fork syncing v8.0.0-rc.0 Q3 2025 release for Fusaka on Holesky waiting-on-author The reviewer has suggested changes and awaits thier implementation.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants